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1. Introduction 

Many papers have been written about limits to computer performance. Some of these have 
been based on quantum-mechanical [8], thermal [8], or size [10] limits. More recently circuit 
and system performance limits on VLSI interconnections and packaging have been 
investigated [1]. This work refines integration and packaging performance limits specifically in 
the context of computer systems. In particular, limits of computer performance under various 
packaging, architectural, organizational, and design techniques (e.g., gate-array vs. custom) are 
explored. 

2. Is 3-Dimensional Packaging Needed? 

There are four components to delay when driving a signal off a package. First, there is a fixed 
driver delay dependent on the chip's technology for going off-chip. Second, once off-chip, there 
is also a delay proportional to the signal length traveled. This transmission delay is dependent on 
the voltage swings, transmission-line environment, and the actual distance traveled which is a 
function of the packaging technology. (These two delays are also present on-chip, but with 
smaller overall delay.) Third, there is a fixed receiver delay. Fourth, there is a delay due to 
clock skew between the transmitter and receiver. This can be a significant percentage of the total 
interconnection delay in high performance systems. In the remainder of the paper, interconnect 
delay will be used to describe the sum of these four components. 

Since the interconnection delay has a term proportional to the distance traveled, the obser- 
vation has been made that the fastest computers would be limited by their size. One obvious 

limit is that the radius of the machine should be smaller than the distance traveled in a clock 
cycle by a signal at the speed of light. More specifically, if some portion of a machine must 
communicate with other parts of the machine in one cycle, the size of the machine is limited by 
the speed of transmission in the packaging media used in the machine. This has led to further 
observations that the maximum volume contained within a radius is a sphere, and that the fastest 
machines must be spheres. This puts a high premium on the development of 3-dimensional or 
"volumetric" packaging techniques. 

The drive for volumetric density has significant implications for computer packaging. Tall 
heat sinks waste machine volume compared to the volume requirements of only the chips and the 
boards. Therefore to maximize packing efficiency, parts must be stacked on top of each other 
leaving no room for heat sinks. Thus the search for volumetric packaging techniques leads to 
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liquid-based immersion cooling techniques [4], where components can be stacked together very 
closely. 

In reality however, the need to make machines in the shape of spheres is only true if the 
machine consists of uniformly distributed interconnections between random points in the 
machine. This is not completely true for any machine, and is mostly untrue for most machines. 

For example, consider a machine in the shape of a sphere. Imagine the machine is pipelined, 
and has independent functional units for floating-point operations. In this case machine perfor- 
mance would not be degraded if we "pulled" the floating-point pipeline out of the sphere (see 
Figure 1). This is because the data in a floating-point functional unit proceeds from pipestage to 
pipestage without communicating with the rest of the machine except at its input and its output.* 
Thus the machine is not limited to a sphere when the interconnections are structured and local 
instead of random. 




Figure 1: Machine structure vs. packaging required 

Another important example of machine structure is provided by instruction and data caches. 
For caches of reasonable size, most instruction and data references are satisfied by the caches 
without recourse to main memory. Since cache misses are the infrequent case, access beyond the 
caches could be made to operate at a slower cycle time than the core of the machine without 
performance degradation. (Of course the overall access time and latency of the cache miss 
should still be minimized.) Similarly, if the CPU is entirely contained on a single chip (not 
including caches, floating-point, or MMU), many signal paths are entirely contained on the CPU 
chip. Some signal distribution frequencies based on the MultiTitan CPU chip [6] are given in 



We can clock the latches outside of the clock-time sphere in several ways. If the clock can be delayed with good 
control of skew, we could add a pipestage of delay to the clocks of the latches one clock time away, etc. Another 
method would be to use clock forwarding techniques, where the clock travels along with the data. 
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Table 1. This shows the very strong tendency towards interconnection "locality of reference." 
In fact, this interconnection locality of reference is a corollary of Rent's rule [9]. This is because 
as the number of gates on a chip increases, the percentage of nets that cross the chip boundary 
gets smaller and smaller. (This is also true for blocks within a chip.) 

Total External Percent 

Unit nets nets external 

1 bit adder 41 13 31.7% 

32-bit adder 1,120 106 9.5% 

CPU datapath 11,263 148 1.3% 

CPU + LI ICache 62,745 136 0.2% 

Above + FPU + MMU 160,000+ 80 <0 . 05% 

Table 1: Internal and external nets vs. integration 

Thus, based on interconnection locality of reference and the structure inherent in computers, 
only a small portion of the machine is limited by speed of light considerations. Since signals 
travel at least Bin per nanosecond on most types of PC boards, this gives us a fair amount of 
room to work with. In particular, 3-D volumetric packaging techniques are unnecessary if the 
core of the machine fits within a circular instead of spherical clock-time radius, or if parts of the 
machine can extend beyond this radius. 

Unfortunately this is not the whole story. If only 1% of the signals are not integrated into the 
core processor but they are on the critical path of the machine, the system performance will still 
be low. In particular, machines which contain no cache, such as Cray machines [2, 4], have 
problems in this respect. By not taking advantage of the locality of signal travel gained from the 
locality of data reference in a cache, main memory must be included in the core of the machine, 
greatly increasing the size of the core machine and the importance of advanced packaging tech- 
niques. 



3. The Relative Importance of Signal Integration 

At each level of integration the best performance is obtained by integrating critical paths of a 
machine onto the same chip. In addition, implementation of wide busses between functional 
units is simplified if both functional units reside on the same chip. Even if the best choices are 
made for the co-integration of circuits at a given density of integration, the delays in packaging 
will create limits to machine performance. Increased pipelining can partially compensate for 
packaging delays. For example, a pipestage could be added for each chip crossing, such that in 
the limit, the cycle time is equal to the maximum interconnection delay between chips. (In fact, 
the longest wires could be further pipelined by inserting additional latch chips, giving a cycle 
time closer to the average wire length.) But deeper pipelining also results in more cycles lost 
due to breaks in the pipeline as a result of branches and data dependencies. In the limit, the 
performance of the machine will be determined by the number of chip crossings required by the 
data and control (i.e., branch) dependencies present in a program. 

To get a better understanding of these limits, we simulated machines with with infinitely fast 
gate delays while varying the density of integration. This provided an upper bound on processor 
performance. Eight programs were simulated with a parameterizable machine architecture and 
compilation system [7]. This system optimizes, reorders, and simulates program code based on 
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the pipeline characteristics specified in a configuration file. The programs simulated were our C 
compiler, PC board router, Linpack, the Livermore Loops, a timing verifier, the Hennessy Stan- 
ford benchmarks, whetstone, and the Unix Yacc program. 

The latencies of functional units in a series of machines were estimated based on the number 
of chip crossings likely for each level of integration. Table 2 lists some assumptions for typical 
machines at each level of integration. For example, in an LSI environment a 32-bit adder might 
be contained on a single chip. But to use the results of an addition in another computation, we 
might have to leave the adder chip and go through a chip containing result forwarding (i.e., 
bypass) logic. Thus the total add delay in this technology involves crossing from the adder to the 
bypass back to the adder, for a total of two interconnection delays. In a VLSI single-chip CPU, 
the integer adder and bypass should be on the same chip. Since we are assuming gate delays are 
zero, the integer add delay on a VLSI chip is zero. However, the floating point unit might con- 
tain separate chips for addition/subtraction, multiplication, division, and floating-point registers 
and bypassing. Then two chip crossings would be required per floating-point operation. The 
only other chip crossings for a VLSI CPU without on-chip caches would be those to fetch in- 
structions or to fetch data during load instructions. Finally, consider an ULSI machine which has 
its first-level instruction and data caches on-chip along with the MMU and floating-point sup- 
port. It would only incur off-chip interconnection delays for cache misses. Two ULSI machines 
were considered, one with small on-chip caches and one with large on-chip caches. 



Characteristic 


LSI 


VLSI 


ULSI#1 


ULSH 


Chip crossings per: 










Integer add 


2 


0 


0 


0 


Load + addr gen 


4 


2 


0 


0 


FP ops 


8 


2 


0 


0 


branch on CPU reg 


4 


0 


0 


0 


1st level cache: 










On-chip or off 


off 


off 


on 


on 


Instruction 


4KB 


64KB 


1KB 


8KB 


Data 


8KB 


64KB 


2KB 


16KB 


2nd level cache : 










Mixed 


none 


none 


4MB 


4MB 


Crossings on miss 






20 


20 



Table 2: Model parameters for each level 



At each level of integration, a family of machines with different pipeline depths were simu- 
lated. For example, consider the pipelines possible with an LSI technology (see Table 3). The 
machine cycle time was varied between different multiples of the chip-to-chip interconnection 
delay. For example, a machine with a cycle time equal to a single chip crossing must have 
latches on each chip input pin. A machine pipelined this heavily will have many pipeline stages 
in technologies with low levels of integration. Similarly, a machine with a cycle time four times 
that of the interconnection delay has a relatively small number of pipeline stages even at lower 
levels of integration. 

Figure 2 shows the results of simulations based on this model. The curves drawn are the 
average performance over all benchmarks. Figure 2 shows that for each level of integration, the 
interconnect delay sets a limit to machine performance. This is because although the cycle time 
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Cycle time in 
interconnection delays 



Characteristic 



4 



2 1 0.5 



Cycle latency per: 
Integer add 
Load + addr gen 
FP ops 

branch on CPU reg 



1 
1 
2 
1 



1 
2 
4 
2 



2 
4 
8 
4 



4 
8 
16 
8 



1st level cache: 
Cycles on miss 



5 



10 



20 



40 



Table 3: Different pipelining possible in LSI 



may be kept small, the extra pipestages that were added hurt performance because of data depen- 
dencies and their resulting breaks in the pipeline. Each curve for a given level of technology 
saturates when most of the performance is already being lost in pipeline breaks. This results in 
little or no additional performance when the number of pipestages is increased by decreasing the 
cycle time (i.e., for points on the curve farther to the right.) For a given cycle time (e.g., equal to 
one interconnect delay), the effects of increasing levels of integration can be seen by moving 
from a lower-integrated curve to a higher-integrated curve. Finally, if one interconnect delay 
(i.e., one on the lower axis) is taken to be 10ns, the relative performance figures correspond 
roughly to processor MIPS. 

Packaging also affects the limit of machine performance. However, most packaging tech- 
nologies are only a second-order effect compared to the level of integration. This is because 
better packaging can reduce the component of interconnection delay due to wire length, but the 
packaging technology generally does not affect the fixed delay component due to the off-chip 
driver. (Certain wafer-scale integration technologies can reduce this in the limit, however.) 
Thus variation in packaging technology only allows system performance to vary from one 
plateau of performance to about the next or previous plateau, depending on whether it is better or 
worse than average packaging. 

Note that the x-axis of Figure 2 is specified in units of interconnection delay. Thus systems 
with high performance signal transmission (e.g., unidirectional single-driver single-receiver ter- 
minated transmission lines) will have significantly higher performance plateaus than systems 
with poor signal transmission (e.g., multi-driver multi-receiver unterminated TTL traces). 

Finally, there are a number of other factors which affect the importance of packaging in sys- 
tem design besides integration and whether the machine is cache-based. Some second-order ef- 
fects that can affect the amount of circuitry integrated on a chip by about a factor of two each are 
whether the machine is a CISC or a RISC machine, whether the design is gate array or full 
custom, and whether the implementation is complex or is simple and straightforward. Thus ad- 
vanced packaging would be least important for a straightforward full-custom implementation of 
a RISC, due to its economical use of chip real estate that allows maximum integration of func- 
tional units at a given level of technology. Advanced packaging will be more important for 
complex gate-array implementations of CISC machines. 
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4. The Advent of Fully-Integrated Processors 

The most interesting plateaus in Figure 2 are the top two. At these levels a CPU, FPU, MMU, 
instruction and data cache have all been integrated onto a single chip. These two plateaus are 
significantly higher than the others because at this level of integration most instructions can ex- 
ecute entirely on chip. Thus for most instructions, they are fully-integrated processors. 

Fully-integrated processors have many benefits. First, since only cache miss and write-back 
traffic crosses the pins, for reasonable on-chip cache sizes the pins can operate at a significantly 
lower frequency than the on-chip clock rate. For example, the off -chip cache refill could occur 
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with a 10ns clock cycle, while the on-chip clock period could be 2.5ns. If the 2.5ns internal 
clock is produced by frequency synthesis from the external 10ns clock via a phase-locked loop, 
no signals with fast edge rates or cycle times under 10ns need appear at the chip's pins. This 
allows rather conventional packaging to be used compared to that required to support a machine 
built from many chips operating with 2.5ns chip crossings. 

5. The New Technology Paradox 

New technologies with very small gate delays are initially available only with relatively low 
levels of integration. As more and more circuits of a new technology are built, technology 
travels down an integration learning curve of increasing density. For example, the first available 
GaAs chips had significantly fewer gates per chip than CMOS or ECL chips in the same time 
frame. 

Thus in order to exploit emerging technologies, advanced packaging technology must be used 
to partially ameliorate the effects of the lower density available in the emerging technology. For 
example, the Cray-3 needs to attain very dense packaging because its GaAs chips only contain 
200-400 gates of logic [5]. Depending on the chip technology advantages and the packaging 
technology available, the overall system performance of a machine built with an emerging tech- 
nology at a low level of integration could still be less than that obtainable with a highly or fully- 
integrated version of the current mature technology. This discourages the widespread use of 
emerging technologies, and slows their progress along technology learning curves. An example 
of this is Josephson junction technology. Although IBM was able to built a prototype Josephson 
junction-based system in the late 1970' s with average gate delays of 44ps [3], the fastest overall 
cycle time for a system of four chips on two cards was 3.7ns [11]. This is a machine cycle time 
approaching 100 gate delays, which is far more than the 8 and 4 gate delays per cycle obtained in 
the Cray-1 and Cray-2 [4]. 

6. Conclusions 

Integration and packaging set limits to processor performance. If cached-based architectures 
are used, only a relatively small core of a machine is limited by interconnection transmission 
delays. If large levels of integration are used to build cache-based machines, the need for 
volumetrically dense packaging is practically eliminated. Machines that contain a CPU, FPU, 
MMU, and instruction and data caches all on one die are fully-integrated processors from the 
standpoint of most instructions. Fully-integrated processors can have modest electrical signal 
FO requirements because the frequency of signals crossing their pins can be several times less 
than that of the on-chip clock frequency. Finally, in order to exploit emerging technologies with- 
out high levels of integration, advanced volumetric packaging techniques will still be very im- 
portant. 
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